Converting PDFs to Text with R

Author

Martin Schweinberger

Introduction

PDF (Portable Document Format) is one of the most pervasive formats for distributing text — research articles, government reports, historical documents, scanned books, and court judgements are routinely archived as PDFs. For linguistic and computational research, the ability to extract clean, machine-readable text from PDFs is therefore a fundamental data-preparation skill. This tutorial shows how to do that efficiently and reliably in R.

Two complementary packages are covered:

  • pdftools — fast, dependency-light extraction for digitally generated PDFs (PDFs rendered from Word, LaTeX, InDesign, or a web browser, where the text layer is embedded in the file)
  • tesseract — slower but more robust OCR (Optical Character Recognition) for image-based PDFs, scanned documents, faxes, and any PDF where the text is stored as a raster image rather than as selectable characters

The tutorial also covers how to extract document metadata and page-level information with pdftools, how to configure the tesseract engine for different languages, how to handle multi-page scanned PDFs, and how to combine OCR output with automated spell-checking and suggested-correction workflows using hunspell.

Prerequisite Tutorials

Before working through this how-to, familiarity with the following is recommended:

Citation

Schweinberger, Martin. 2026. Converting PDFs to Text with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/pdf2txt/pdf2txt.html (Version 2026.02.24).


pdftools vs tesseract: Choosing the Right Tool

Before writing a single line of code, the most important decision is which extraction tool to use. The answer depends entirely on how the PDF was created.

Choosing between pdftools and tesseract
Situation Recommended tool
PDF rendered from Word, LaTeX, or InDesign pdftools
PDF saved from a web browser or exported from software pdftools
Scanned physical document (book, report, fax) tesseract
PDF of a photograph or image tesseract
PDF with embedded fonts but garbled character encoding tesseract
Mixed PDF (some pages have text layer, others are scanned) tesseract for scanned pages; pdftools for text pages
Non-Latin script (Arabic, Chinese, Devanagari, etc.) tesseract with the appropriate language model

How to tell which type you have: Open the PDF in a PDF viewer and try to select and copy a word. If you can select individual characters and the copied text is legible, the PDF has an embedded text layer and pdftools is the right choice. If selecting text is impossible, produces garbled output, or selects only whole blocks, the PDF is image-based and you need tesseract.

Quick Diagnostic in R
Code
# A quick way to check whether a PDF has a usable text layer:
# if nchar() returns 0 or near-zero for all pages, use tesseract instead
library(pdftools)
test <- pdftools::pdf_text("your_file.pdf")
nchar(test)   # characters per page — 0 means no text layer

Setup

Installing Packages

Code
# Run once to install — comment out after installation
install.packages("pdftools")
install.packages("tesseract")
install.packages("tidyverse")
install.packages("here")
install.packages("hunspell")
install.packages("flextable")
System Dependencies for tesseract

The tesseract R package is a wrapper around the Tesseract OCR engine, which must be installed separately as a system library before the R package will work.

  • Windows: Download and run the installer from github.com/UB-Mannheim/tesseract/wiki. After installation, make sure the Tesseract binary folder is on your system PATH.
  • macOS: Run brew install tesseract in a Terminal (requires Homebrew).
  • Linux (Debian/Ubuntu): Run sudo apt-get install tesseract-ocr in a Terminal.

Additional language packs for non-English OCR can be installed separately — see the Language Support section below.

Loading Packages

Code
library(pdftools)    # text-layer PDF extraction and metadata
library(tesseract)   # OCR for image-based PDFs
library(tidyverse)   # dplyr, stringr, purrr
library(here)        # portable file paths
library(hunspell)    # spell checking and correction
library(flextable)   # formatted display tables

# Initialise the English Tesseract engine once and reuse it
eng <- tesseract::tesseract("eng")

Data and Folder Setup

The code in this tutorial assumes the following folder structure within your R project:

your_project/
├── data/
│   └── PDFs/
│       ├── pdf0.pdf   (Wikipedia: Corpus linguistics)
│       ├── pdf1.pdf   (Wikipedia: Linguistics)
│       ├── pdf2.pdf   (Wikipedia: Natural language processing)
│       └── pdf3.pdf   (Wikipedia: Computational linguistics)
└── pdf2txt.qmd

Download the four sample PDF files from the links below and save them in data/PDFs/: pdf0 · pdf1 · pdf2 · pdf3


Text Extraction with pdftools

Section Overview

What you’ll learn: How to extract text from a single PDF and from a folder of PDFs using pdftools, how to retrieve document metadata and page-level information, how to work with page numbers, and how to save extracted text to disk

The pdftools package (ooms2022pdftools?) provides fast, dependency-light text extraction for PDFs that have an embedded text layer. It wraps the Poppler PDF rendering library and works without any external system dependencies beyond Poppler itself (which is bundled with the package on Windows and macOS).

Extracting Text from a Single PDF

The workhorse function is pdftools::pdf_text(). It returns a character vector with one element per page; we paste the pages together and collapse any internal whitespace with str_squish().

Code
# Path to the PDF (Wikipedia article on corpus linguistics)
pdf_path <- "tutorials/pdf2txt/data/PDFs/pdf0.pdf"

# Extract text: one element per page
pages <- pdftools::pdf_text(pdf_path)
cat("Pages extracted:", length(pages), "\n")
Pages extracted: 2 
Code
# Collapse all pages into a single string and clean whitespace
txt_output <- pages |>
  paste0(collapse = " ") |>
  stringr::str_squish()

substr(txt_output, 1, 1000)

Corpus linguistics - Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics Corpus linguistics Corpus linguistics is the study of language as expressed in corpora (samples) of "real world" text. Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context ("realia"), and with minimal experimental-interference. The field of corpus linguistics features divergent views about the value of corpus annotation. These views range from John McHardy Sinclair, who advocates minimal annotation so texts speak for themselves,[1] to the Survey of English Usage team (University College, London), who advocate annotation as allowing greater linguistic understanding through rigorous recording.[2] The text-corpus method is a digestive approach that derives a set of abstract rules that govern a natural language from texts in that language, and explores how that language relates to other languages. Originally derived manually, cor

Working Page by Page

Sometimes it is more useful to keep the page structure rather than collapsing everything into one string — for instance, when you need to track which page a quote came from, or when processing very large PDFs that would be unwieldy as a single string. In that case, work with the pages vector directly:

Code
# Process pages individually: clean each page separately
pages_clean <- pages |>
  purrr::map_chr(stringr::str_squish)

# Inspect the second page
cat(pages_clean[2])

# Create a data frame with one row per page
page_df <- data.frame(
  page = seq_along(pages_clean),
  text = pages_clean
)

Extracting Document Metadata

pdftools::pdf_info() returns a rich list of document metadata: title, author, creation date, modification date, PDF version, page dimensions, and more. This information is useful for provenance tracking and for verifying that you have the right document.

Code
meta <- pdftools::pdf_info(pdf_path)

# Display selected metadata fields
data.frame(
  Field = c("Pages", "PDF version", "Title", "Author",
            "Creator", "Created", "Modified"),
  Value = c(
    meta$pages,
    meta$version,
    ifelse(is.null(meta$keys$Title),   "—", meta$keys$Title),
    ifelse(is.null(meta$keys$Author),  "—", meta$keys$Author),
    ifelse(is.null(meta$keys$Creator), "—", meta$keys$Creator),
    format(meta$created,  "%Y-%m-%d %H:%M"),
    format(meta$modified, "%Y-%m-%d %H:%M")
  )
) |>
  flextable::flextable() |>
  flextable::set_table_properties(width = .75, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 10) |>
  flextable::set_caption(caption = "Document metadata extracted from pdf0.pdf.") |>
  flextable::border_outer()

Field

Value

Pages

2

PDF version

1.7

Title

Corpus linguistics - Wikipedia

Author

marti

Creator

PDF24 Creator

Created

2020-06-09 15:53

Modified

2020-06-09 15:53

Extracting Page-Level Information

pdftools::pdf_pagesize() returns the width and height of each page (in points, where 1 point = 1/72 inch). This is useful for detecting mixed-orientation documents (portrait and landscape pages) or for understanding the physical layout of tables and figures.

pdftools::pdf_fonts() lists the fonts embedded in the document — helpful for diagnosing encoding problems or unusual character sets.

Code
# Page dimensions (in points)
page_sizes <- pdftools::pdf_pagesize(pdf_path)
head(page_sizes, 3)
  top right bottom left width height
1   0   612    792    0   612    792
2   0   612    792    0   612    792
Code
# Embedded fonts
fonts <- pdftools::pdf_fonts(pdf_path)
head(fonts, 6)
# A tibble: 6 × 4
  name                  type     embedded file 
  <chr>                 <chr>    <lgl>    <chr>
1 EXUXUG+Georgia,Bold   truetype TRUE     ""   
2 ZTNEET+Arial,Bold     truetype TRUE     ""   
3 OCIQLU+Georgia        truetype TRUE     ""   
4 AWFSJZ+Arial          truetype TRUE     ""   
5 FQLOPB+Georgia,Italic truetype TRUE     ""   
6 CSYVUW+TimesNewRoman  truetype TRUE     ""   

Extracting Text from Many PDFs

For batch processing, we write a reusable function that takes a folder path, finds all PDF files, extracts and cleans their text, and returns a named character vector.

Code
convertpdf2txt <- function(dirpath, pattern = "\\.pdf$") {
  files <- list.files(dirpath, pattern = pattern,
                      full.names = TRUE, ignore.case = TRUE)
  if (length(files) == 0) stop("No PDF files found in: ", dirpath)

  texts <- sapply(files, function(f) {
    pdftools::pdf_text(f) |>
      paste0(collapse = " ") |>
      stringr::str_squish()
  }, USE.NAMES = TRUE)

  # Use clean base names (without path and extension) as names
  names(texts) <- tools::file_path_sans_ext(basename(files))
  return(texts)
}
Code
txts <- convertpdf2txt(here::here("tutorials/pdf2txt/data/PDFs"))
cat("Texts extracted:", length(txts), "\n")
Texts extracted: 4 
Code
cat("Names:", paste(names(txts), collapse = ", "), "\n")
Names: pdf0, pdf1, pdf2, pdf3 

substr(txts, 1, 800)

Corpus linguistics - Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics Corpus linguistics Corpus linguistics is the study of language as expressed in corpora (samples) of "real world" text. Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context ("realia"), and with minimal experimental-interference. The field of corpus linguistics features divergent views about the value of corpus annotation. These views range from John McHardy Sinclair, who advocates minimal annotation so texts speak for themselves,[1] to the Survey of English Usage team (University College, London), who advocate annotation as allowing greater linguistic understanding through rigorous recording.[2] The text-corpus method is a digesti

Language - Wikipedia https://en.wikipedia.org/wiki/Language Language A language is a structured system of communication. Language, in a broader sense, is the method of communication that involves the use of – particularly human – languages.[1][2][3] The scientific study of language is called linguistics. Questions concerning the philosophy of language, such as whether words can represent experience, have been debated at least since Gorgias and Plato in ancient Greece. Thinkers such as Rousseau have argued that language originated from emotions while others like Kant have held that it originated from rational and logical thought. 20th-century philosophers such as Wittgenstein argued that philosophy is really the study of language. Major figures in linguistics include Ferdinand de Saussure a

Natural language processing - Wikipedia https://en.wikipedia.org/wiki/Natural_language_processing Natural language processing Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. Contents History Rule-based vs. statistical NLP Major evaluations and tasks Syntax Semantics An automated online assistant Discourse providing customer service on a Speech web page, an example of an Dialogue ap

Computational linguistics - Wikipedia https://en.wikipedia.org/wiki/Computational_linguistics Computational linguistics Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, as well as the study of appropriate computational approaches to linguistic questions. Traditionally, computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language. Today, computational linguists often work as members of interdisciplinary teams, which can include regular linguists, experts in the target language, and computer scientists. In general, computational linguistics draws upon the involvement of linguists, compu

Building a Page-Level Data Frame

For downstream corpus analysis it is often useful to have a tidy data frame with one row per page, carrying the document name and page number alongside the text. This structure integrates naturally with dplyr pipelines.

Code
files <- list.files(here::here("tutorials/pdf2txt/data/PDFs"),
                    pattern = "pdf$", full.names = TRUE, ignore.case = TRUE)

page_corpus <- purrr::map_dfr(files, function(f) {
  pages <- pdftools::pdf_text(f)
  data.frame(
    document = tools::file_path_sans_ext(basename(f)),
    page     = seq_along(pages),
    text     = stringr::str_squish(pages),
    stringsAsFactors = FALSE
  )
})

head(page_corpus[, c("document", "page")], 8)
  document page
1     pdf0    1
2     pdf0    2
3     pdf1    1
4     pdf1    2
5     pdf1    3
6     pdf1    4
7     pdf1    5
8     pdf1    6
Code
cat("Total pages across all documents:", nrow(page_corpus), "\n")
Total pages across all documents: 21 

Saving Extracted Texts to Disk

Code
# Save each text as a .txt file in data/txts/
output_dir <- here::here("tutorials/pdf2txt/data/txts")
dir.create(output_dir, showWarnings = FALSE)

lapply(seq_along(txts), function(i) {
  out_path <- file.path(output_dir, paste0(names(txts)[i], ".txt"))
  writeLines(text = txts[[i]], con = out_path)
  message("Saved: ", out_path)
})

OCR with tesseract

Section Overview

What you’ll learn: How to perform OCR on image-based PDFs using tesseract, how to configure the OCR engine, how to use non-English language models, and how to handle multi-page scanned PDFs

The tesseract package (ooms2023tesseract?) provides R bindings for Google’s Tesseract OCR engine, an open-source OCR system that supports over 100 languages. Unlike pdftools, which reads an embedded text layer, tesseract analyses the image content of each page and attempts to identify characters from their visual appearance. This makes it the right tool for scanned documents, photographs of text, and any PDF where the text is stored as pixels rather than as Unicode characters.

Basic OCR on a Folder of PDFs

Code
fls <- list.files(here::here("tutorials/pdf2txt/data/PDFs"), full.names = TRUE)

ocrs <- sapply(fls, function(x) {
  nm  <- tools::file_path_sans_ext(basename(x))
  txt <- tesseract::ocr(x, engine = eng) |>
    paste0(collapse = " ")
  return(txt)
}, USE.NAMES = TRUE)
Converting page 1 to pdf0_1.png... done!
Converting page 2 to pdf0_2.png... done!
Converting page 1 to pdf1_1.png... done!
Converting page 2 to pdf1_2.png... done!
Converting page 3 to pdf1_3.png... done!
Converting page 4 to pdf1_4.png... done!
Converting page 5 to pdf1_5.png... done!
Converting page 6 to pdf1_6.png... done!
Converting page 7 to pdf1_7.png... done!
Converting page 8 to pdf1_8.png... done!
Converting page 9 to pdf1_9.png... done!
Converting page 10 to pdf1_10.png... done!
Converting page 11 to pdf1_11.png... done!
Converting page 1 to pdf2_1.png... done!
Converting page 2 to pdf2_2.png... done!
Converting page 3 to pdf2_3.png... done!
Converting page 4 to pdf2_4.png... done!
Converting page 1 to pdf3_1.png... done!
Converting page 2 to pdf3_2.png... done!
Converting page 3 to pdf3_3.png... done!
Converting page 4 to pdf3_4.png... done!
Code
names(ocrs) <- tools::file_path_sans_ext(basename(fls))

substr(ocrs, 1, 800)

Corpus linguistics - Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics
WIKIPEDIA
e e e
Corpus linguistics
Corpus linguistics is the study of language as expressed in corpora (samples) of "real world" text. Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the
field in its natural context ("realia"), and with minimal experimental-interference.
The field of corpus linguistics features divergent views about the value of corpus annotation. These views range from John McHardy Sinclair, who advocates minimal annotation so texts speak for
themselves, to the Survey of English Usage team (University College, London), who advocate annotation as allowing greater linguistic understanding through rigorous recording.|2!
The text-corpus method

Corpus linguistics - Wikipedia https://en.wikipedia.org/wiki/Corpus_linguistics
WIKIPEDIA
e e e
Corpus linguistics
Corpus linguistics is the study of language as expressed in corpora (samples) of "real world" text. Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the
field in its natural context ("realia"), and with minimal experimental-interference.
The field of corpus linguistics features divergent views about the value of corpus annotation. These views range from John McHardy Sinclair, who advocates minimal annotation so texts speak for
themselves, to the Survey of English Usage team (University College, London), who advocate annotation as allowing greater linguistic understanding through rigorous recording.|2!
The text-corpus method

Language - Wikipedia https://en.wikipedia.org/wiki/Language
WIKIPEDIA
Language
A language is a structured system of communication. Language, in a broader sense, is the method of communication that involves the use of — particularly human —
languages. J[2II3]
The scientific study of language is called linguistics. Questions concerning the philosophy of language, such as whether words can represent experience, have been g
debated at least since Gorgias and Plato in ancient Greece. Thinkers such as Rousseau have argued that language originated from emotions while others like Kant have ~ ‘
held that it originated from rational and logical thought. 20th-century philosophers such as Wittgenstein argued that philosophy is really the study of language. Major Pay yi
figures in linguistics include F

Language - Wikipedia https://en.wikipedia.org/wiki/Language
WIKIPEDIA
Language
A language is a structured system of communication. Language, in a broader sense, is the method of communication that involves the use of — particularly human —
languages. J[2II3]
The scientific study of language is called linguistics. Questions concerning the philosophy of language, such as whether words can represent experience, have been g
debated at least since Gorgias and Plato in ancient Greece. Thinkers such as Rousseau have argued that language originated from emotions while others like Kant have ~ ‘
held that it originated from rational and logical thought. 20th-century philosophers such as Wittgenstein argued that philosophy is really the study of language. Major Pay yi
figures in linguistics include F

Natural language processing - Wikipedia https://en.wikipedia.org/wiki/Natural_ language processing
WIKIPEDIA
e
Natural language processing
Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions —
between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. ce ee EI
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. —— oo
- &
Contents nae
History eicseuuc
Rule-based vs. statistical NLP chm
Hi I'm your automated online ‘ter coremer s
Major evaluations and tasks svat How may nt ou? f= kd
Syntax iereii

Natural language processing - Wikipedia https://en.wikipedia.org/wiki/Natural_ language processing
WIKIPEDIA
e
Natural language processing
Natural language processing (NLP) is a subfield of linguistics, computer science, information engineering, and artificial intelligence concerned with the interactions —
between computers and human (natural) languages, in particular how to program computers to process and analyze large amounts of natural language data. ce ee EI
Challenges in natural language processing frequently involve speech recognition, natural language understanding, and natural language generation. —— oo
- &
Contents nae
History eicseuuc
Rule-based vs. statistical NLP chm
Hi I'm your automated online ‘ter coremer s
Major evaluations and tasks svat How may nt ou? f= kd
Syntax iereii

Computational linguistics - Wikipedia https://en.wikipedia.org/wiki/Computational_ linguistics
WIKIPEDIA
e e e e
Computational linguistics
Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, as well as the study of appropriate
computational approaches to linguistic questions.
Traditionally, computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language. Today, computational linguists
often work as members of interdisciplinary teams, which can include regular linguists, experts in the target language, and computer scientists. In general, computational linguistics draws upon the
involvement

Computational linguistics - Wikipedia https://en.wikipedia.org/wiki/Computational_ linguistics
WIKIPEDIA
e e e e
Computational linguistics
Computational linguistics is an interdisciplinary field concerned with the statistical or rule-based modeling of natural language from a computational perspective, as well as the study of appropriate
computational approaches to linguistic questions.
Traditionally, computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language. Today, computational linguists
often work as members of interdisciplinary teams, which can include regular linguists, experts in the target language, and computer scientists. In general, computational linguistics draws upon the
involvement

Language Support

By default, tesseract() uses the English language model ("eng"). For documents in other languages, you must first install the relevant Tesseract language pack and then initialise an engine with that language code.

Code
# List all language packs already installed on your system
tesseract::tesseract_info()$available

# Install additional language packs from within R
# (downloads the trained model data from the tesseract-ocr GitHub repository)
tesseract::tesseract_download("deu")   # German
tesseract::tesseract_download("fra")   # French
tesseract::tesseract_download("zho")   # Chinese (simplified)
tesseract::tesseract_download("ara")   # Arabic
tesseract::tesseract_download("hin")   # Hindi (Devanagari)
Code
# Initialise an engine for a specific language
deu <- tesseract::tesseract("deu")
fra <- tesseract::tesseract("fra")

# OCR a German-language PDF
german_text <- tesseract::ocr("path/to/german_document.pdf", engine = deu)
Multi-Language Documents

If a document contains text in more than one language, you can initialise a combined engine by passing a +-separated language string:

Code
# Engine that handles both English and German
eng_deu <- tesseract::tesseract("eng+deu")
mixed_text <- tesseract::ocr("mixed_language_doc.pdf", engine = eng_deu)

Recognition accuracy decreases somewhat with combined engines compared to single-language engines, so use this only when necessary.

Engine Configuration Options

The tesseract engine exposes many configuration parameters via the options argument of tesseract::tesseract(). The most practically useful are:

Code
# Page segmentation modes (psm) control how Tesseract analyses page layout:
# 1  = Automatic page segmentation with OSD (orientation and script detection)
# 3  = Fully automatic page segmentation (default)
# 6  = Assume a single uniform block of text
# 11 = Sparse text — find as much text as possible in no particular order
# 13 = Raw line — treat the image as a single text line

# OCR engine modes (oem):
# 0 = Legacy Tesseract engine only
# 1 = Neural nets LSTM engine only (best for most modern documents)
# 2 = Legacy + LSTM engines combined
# 3 = Default (based on what is available)

# Example: configure for a clean single-column document
eng_clean <- tesseract::tesseract(
  language = "eng",
  options  = list(
    tessedit_pageseg_mode = 6,   # single uniform block
    tessedit_ocr_engine_mode = 1 # LSTM only (most accurate)
  )
)

# Example: configure for sparse or noisy text (e.g. forms, tables)
eng_sparse <- tesseract::tesseract(
  language = "eng",
  options  = list(tessedit_pageseg_mode = 11)
)
Choosing the Page Segmentation Mode

The default mode (psm = 3, fully automatic) works well for most documents with a standard single- or multi-column layout. Use psm = 6 for clean, uniform text blocks (academic papers, novels). Use psm = 11 for heavily fragmented layouts such as invoices, forms, or partially damaged scans. Use psm = 13 for single lines of text, such as captions or labels.

Handling Multi-Page Scanned PDFs

Scanned PDFs often contain many pages, each stored as a raster image. tesseract::ocr() handles multi-page PDFs natively — it renders each page as an image internally and runs OCR on each in sequence. However, for very large documents it can be useful to process pages in parallel or to save intermediate results to disk to avoid having to re-run expensive OCR if the process is interrupted.

Code
# For a large scanned PDF: process page by page and save intermediate results
large_pdf  <- "path/to/large_scanned_document.pdf"
output_dir <- here::here("data", "ocr_pages")
dir.create(output_dir, showWarnings = FALSE)

# Get total number of pages
n_pages <- pdftools::pdf_info(large_pdf)$pages
cat("Total pages:", n_pages, "\n")

# Process each page individually; save to disk as we go
for (i in seq_len(n_pages)) {
  out_file <- file.path(output_dir, sprintf("page_%04d.txt", i))

  # Skip pages already processed (allows resuming after interruption)
  if (file.exists(out_file)) next

  page_text <- tesseract::ocr(large_pdf, engine = eng, pages = i)
  writeLines(page_text, con = out_file)
  if (i %% 10 == 0) message("Processed page ", i, " of ", n_pages)
}

# Reassemble all pages into a single text
page_files <- list.files(output_dir, pattern = "\\.txt$",
                         full.names = TRUE)
full_text  <- sapply(page_files, readLines) |>
  unlist() |>
  paste0(collapse = " ") |>
  stringr::str_squish()
Code
# Alternative: parallel processing with furrr (requires the furrr package)
# install.packages("furrr")
library(furrr)
future::plan(multisession, workers = 4)   # use 4 CPU cores

n_pages    <- pdftools::pdf_info(large_pdf)$pages
page_texts <- furrr::future_map_chr(
  seq_len(n_pages),
  ~tesseract::ocr(large_pdf, engine = eng, pages = .x) |>
    paste0(collapse = " "),
  .progress = TRUE
)
full_text_parallel <- paste0(page_texts, collapse = " ")
OCR Is Slow

Tesseract processes approximately 1–5 pages per minute depending on page resolution, image quality, page segmentation mode, and hardware. A 200-page scanned book may take 40 minutes to an hour. Always save intermediate results page-by-page (as shown above) so that you can resume without reprocessing completed pages if R crashes or times out.

Pre-Processing Images to Improve OCR Accuracy

OCR accuracy depends heavily on image quality. For documents that produce poor results, pre-processing the page images before OCR can substantially improve recognition. The magick package (a wrapper around ImageMagick) provides the tools most commonly needed.

Code
# install.packages("magick")
library(magick)

improve_ocr <- function(pdf_path, engine = eng) {
  # Convert PDF pages to high-resolution PNG images
  imgs <- magick::image_read_pdf(pdf_path, density = 300)  # 300 dpi

  page_texts <- sapply(seq_along(imgs), function(i) {
    img <- imgs[i] |>
      magick::image_convert(type = "Grayscale") |>    # convert to greyscale
      magick::image_contrast(sharpen = 1) |>          # enhance contrast
      magick::image_despeckle() |>                    # remove noise
      magick::image_deskew(threshold = 40)            # straighten tilted pages

    # Run OCR on the pre-processed image
    tesseract::ocr(img, engine = engine)
  })

  paste0(page_texts, collapse = " ")
}

# Apply to a scanned PDF
clean_text <- improve_ocr("path/to/noisy_scan.pdf", engine = eng)
Most Impactful Pre-Processing Steps

In order of typical impact on OCR accuracy:

  1. Resolution — scan/render at 300 dpi minimum; 400–600 dpi for small or degraded fonts
  2. Deskew — correct page rotation introduced during scanning
  3. Greyscale conversion — remove colour information that can confuse character detection
  4. Contrast enhancement — improve separation between ink and background
  5. Despeckle — remove noise from scanner sensors or damaged paper

Spell Checking and Correction

Section Overview

What you’ll learn: How to check OCR output for non-dictionary words using hunspell, how to generate and apply automated spelling suggestions, and how to identify and review the most frequent OCR errors

Even high-quality OCR produces errors — especially for degraded documents, unusual fonts, or non-standard layouts. Common OCR error patterns include: l mistaken for 1 or I, rn mistaken for m, cl mistaken for d, and hyphenated line-break artefacts. Automated spell-checking cannot catch all errors (particularly proper nouns, technical terms, or correctly spelled but contextually wrong words), but it is a fast and effective first pass for cleaning OCR output.

Tokenising and Checking Spelling

hunspell::hunspell_parse() splits text into word tokens. hunspell::hunspell_check() returns TRUE for each token that is found in the dictionary and FALSE for each token that is not.

Code
# Tokenise OCR output into word vectors (one vector per document)
tokens_ocr <- lapply(ocrs, function(x) {
  hunspell::hunspell_parse(x, dict = hunspell::dictionary("en_US"))[[1]]
})

# How many tokens per document?
sapply(tokens_ocr, length)
  pdf0 pdf0_1   pdf1 pdf1_1   pdf2 pdf2_1   pdf3 pdf3_1 
  2101   1064  15987    914   4617   1475   5687   1124 
Code
# Check which tokens are in the dictionary
spelling_check <- lapply(tokens_ocr, function(toks) {
  data.frame(
    token   = toks,
    correct = hunspell::hunspell_check(toks,
                dict = hunspell::dictionary("en_US"))
  )
})

# Proportion of correctly spelled tokens per document
sapply(spelling_check, function(x) round(mean(x$correct) * 100, 1))
  pdf0 pdf0_1   pdf1 pdf1_1   pdf2 pdf2_1   pdf3 pdf3_1 
  86.6   96.8   92.9   95.4   92.3   96.9   87.9   98.8 

Reviewing the Most Frequent Errors

Before applying any automated correction, it is worth inspecting the most frequent non-dictionary tokens. Many will be proper nouns, technical terms, or hyphenated compounds that are perfectly correct — these should be added to an ignore list rather than corrected.

Code
# Collect all non-dictionary tokens across all documents
all_errors <- lapply(spelling_check, function(x) {
  x$token[!x$correct]
}) |>
  unlist()

# Frequency table of the 20 most common non-dictionary tokens
error_freq <- sort(table(all_errors), decreasing = TRUE)

data.frame(
  token = names(error_freq),
  count = as.integer(error_freq)
) |>
  head(20) |>
  flextable::flextable() |>
  flextable::set_table_properties(width = .5, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 10) |>
  flextable::set_caption(
    caption = "20 most frequent non-dictionary tokens across all OCR outputs."
  ) |>
  flextable::border_outer()

token

count

https

160

http

79

doi

71

www

71

edu

31

wikipedia

29

von

26

nih

21

Trask

21

ncbi

20

de

17

NLP

17

ae

16

ee

14

PMC

14

Awww

13

html

13

nim

13

pdf

13

PMID

13

Generating Spelling Suggestions

hunspell::hunspell_suggest() returns a list of candidate corrections for each non-dictionary token, ranked by edit distance from the input. We take the first (best) suggestion where one is available.

Code
# Get suggestions for the 20 most common errors
top_errors <- names(error_freq)[1:20]

suggestions <- hunspell::hunspell_suggest(
  top_errors,
  dict = hunspell::dictionary("en_US")
)

# Build a review table: original token + best suggestion
suggestion_df <- data.frame(
  token      = top_errors,
  suggestion = sapply(suggestions, function(s) {
    if (length(s) == 0) NA_character_ else s[1]
  })
)

suggestion_df |>
  flextable::flextable() |>
  flextable::set_table_properties(width = .5, layout = "autofit") |>
  flextable::theme_zebra() |>
  flextable::fontsize(size = 10) |>
  flextable::set_caption(
    caption = "Top 20 non-dictionary tokens with best hunspell correction suggestion."
  ) |>
  flextable::border_outer()

token

suggestion

https

HTTP

http

HTTP

doi

dpi

www

WWW

edu

ed

wikipedia

Wikipedia

von

con

nih

NIH

Trask

Task

ncbi

cabin

de

DE

NLP

NIP

ae

eye

ee

i

PMC

PM

Awww

WWW

html

HTML

nim

min

pdf

PDF

PMID

MID

Always Review Before Applying

Automated suggestions should be reviewed before being applied. hunspell_suggest() picks candidates purely based on character edit distance — it has no knowledge of context and will frequently suggest plausible-looking but wrong corrections. For example, the OCR error cornputer might be correctly suggested as computer, but rnodels might be suggested as models or noodles with equal confidence. Always check the suggestion table manually and build a curated correction dictionary for your specific document type.

Applying a Curated Correction Dictionary

The recommended workflow is to review the suggestion table, manually confirm or override each correction, and then apply the full set of corrections as a batch string replacement.

Code
# Define a curated correction dictionary after manual review
# (example entries — adjust based on your actual OCR errors)
correction_dict <- c(
  "cornputer"  = "computer",
  "languagc"   = "language",
  "analysls"   = "analysis",
  "iinguistics" = "linguistics",
  "processlng" = "processing"
)

# Apply corrections to all OCR texts
apply_corrections <- function(text, dict) {
  for (wrong in names(dict)) {
    text <- stringr::str_replace_all(
      text,
      pattern     = paste0("\\b", wrong, "\\b"),
      replacement = dict[[wrong]]
    )
  }
  return(text)
}

corrected_texts <- sapply(ocrs, apply_corrections, dict = correction_dict)

Simple Automated Correction (Aggressive Mode)

If you prefer a fully automated approach and are willing to accept some incorrect corrections in exchange for speed, the following pipeline replaces every non-dictionary token with the best available suggestion. Use this with caution on documents containing technical vocabulary, proper names, or non-standard spellings.

Code
# Automated correction: replace every non-dictionary token with best suggestion
# WARNING: will incorrectly "correct" proper nouns and technical terms
clean_ocrtext <- sapply(tokens_ocr, function(toks) {
  correct  <- hunspell::hunspell_check(toks,
                dict = hunspell::dictionary("en_US"))
  suggs    <- hunspell::hunspell_suggest(toks[!correct],
                dict = hunspell::dictionary("en_US"))
  # Replace non-dictionary tokens with first suggestion (if available)
  toks[!correct] <- sapply(suggs, function(s) {
    if (length(s) == 0) NA_character_ else s[1]
  })
  # Remove tokens for which no suggestion was found
  toks <- toks[!is.na(toks)]
  paste0(toks, collapse = " ")
})

substr(clean_ocrtext, 1, 800)

Corpus linguistics Wikipedia HTTP en Wikipedia org wiki Corpus linguistics WIKIPEDIA e e e Corpus linguistics Corpus linguistics is the study of language as expressed in corpora samples of real world text Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context regalia and with minimal experimental interference The field of corpus linguistics features divergent views about the value of corpus annotation These views range from John Hardy Sinclair who advocates minimal annotation so texts speak for themselves to the Survey of English Usage team University College London who advocate annotation as allowing greater linguistic understanding through rigorous recording The text corpus method is a digestive approach tha

Corpus linguistics Wikipedia HTTP en Wikipedia org wiki Corpus linguistics WIKIPEDIA e e e Corpus linguistics Corpus linguistics is the study of language as expressed in corpora samples of real world text Corpus linguistics proposes that reliable language analysis is more feasible with corpora collected in the field in its natural context regalia and with minimal experimental interference The field of corpus linguistics features divergent views about the value of corpus annotation These views range from John Hardy Sinclair who advocates minimal annotation so texts speak for themselves to the Survey of English Usage team University College London who advocate annotation as allowing greater linguistic understanding through rigorous recording The text corpus method is a digestive approach tha

Language Wikipedia HTTP en Wikipedia org wiki Language WIKIPEDIA Language A language is a structured system of communication Language in a broader sense is the method of communication that involves the use of particularly human languages J II The scientific study of language is called linguistics Questions concerning the philosophy of language such as whether words can represent experience have been g debated at least since Gorgas and Plato in ancient Greece Thinkers such as Rousseau have argued that language originated from emotions while others like Kant have held that it originated from rational and logical thought ht century philosophers such as Wittgenstein argued that philosophy is really the study of language Major Pay ti figures in linguistics include Ferdinand DE Saussure and Nam

Language Wikipedia HTTP en Wikipedia org wiki Language WIKIPEDIA Language A language is a structured system of communication Language in a broader sense is the method of communication that involves the use of particularly human languages J II The scientific study of language is called linguistics Questions concerning the philosophy of language such as whether words can represent experience have been g debated at least since Gorgas and Plato in ancient Greece Thinkers such as Rousseau have argued that language originated from emotions while others like Kant have held that it originated from rational and logical thought ht century philosophers such as Wittgenstein argued that philosophy is really the study of language Major Pay ti figures in linguistics include Ferdinand DE Saussure and Nam

Natural language processing Wikipedia HTTP en Wikipedia org wiki Natural language processing WIKIPEDIA e Natural language processing Natural language processing NIP is a sub field of linguistics computer science information engineering and artificial intelligence concerned with the interactions between computers and human natural languages in particular how to program computers to process and analyze large amounts of natural language data Ce i A Challenges in natural language processing frequently involve speech recognition natural language understanding and natural language generation u Contents nae History eugenics Rule based vs statistical NIP chm Hi I'm your automated online yer corer s Major evaluations and tasks scat How may NT oi f ks Syntax portieres Semantics An automated online a

Natural language processing Wikipedia HTTP en Wikipedia org wiki Natural language processing WIKIPEDIA e Natural language processing Natural language processing NIP is a sub field of linguistics computer science information engineering and artificial intelligence concerned with the interactions between computers and human natural languages in particular how to program computers to process and analyze large amounts of natural language data Ce i A Challenges in natural language processing frequently involve speech recognition natural language understanding and natural language generation u Contents nae History eugenics Rule based vs statistical NIP chm Hi I'm your automated online yer corer s Major evaluations and tasks scat How may NT oi f ks Syntax portieres Semantics An automated online a

Computational linguistics Wikipedia HTTP en Wikipedia org wiki Computational linguistics WIKIPEDIA e e e e Computational linguistics Computational linguistics is an interdisciplinary field concerned with the statistical or rule based modeling of natural language from a computational perspective as well as the study of appropriate computational approaches to linguistic questions Traditionally computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language Today computational linguists often work as members of interdisciplinary teams which can include regular linguists experts in the target language and computer scientists In general computational linguistics draws upon the involvement of linguists com

Computational linguistics Wikipedia HTTP en Wikipedia org wiki Computational linguistics WIKIPEDIA e e e e Computational linguistics Computational linguistics is an interdisciplinary field concerned with the statistical or rule based modeling of natural language from a computational perspective as well as the study of appropriate computational approaches to linguistic questions Traditionally computational linguistics was performed by computer scientists who had specialized in the application of computers to the processing of a natural language Today computational linguists often work as members of interdisciplinary teams which can include regular linguists experts in the target language and computer scientists In general computational linguistics draws upon the involvement of linguists com


Putting It All Together

Section Overview

What you’ll learn: A complete, production-ready workflow function that selects the appropriate extraction method (pdftools or tesseract), extracts text, and optionally applies spell correction — all in a single call

The code below wraps the full pipeline into a single reusable function. It accepts a path to a PDF or a directory of PDFs, detects whether each file has an embedded text layer (and falls back to tesseract if not), and optionally applies spell correction.

Code
#' Extract text from one or more PDFs, choosing the best method automatically
#'
#' @param path  Path to a single PDF file or a directory containing PDFs
#' @param lang  Tesseract language code (default: "eng")
#' @param spell_correct  Apply automated spell correction to OCR output?
#' @param min_chars_per_page  Minimum characters per page to consider text
#'                            layer valid (below this, fall back to tesseract)
#' @return Named character vector of extracted texts
extract_pdf_text <- function(path,
                             lang               = "eng",
                             spell_correct      = FALSE,
                             min_chars_per_page = 50) {

  # Resolve input: single file or directory
  if (dir.exists(path)) {
    files <- list.files(path, pattern = "\\.pdf$",
                        full.names = TRUE, ignore.case = TRUE)
  } else if (file.exists(path)) {
    files <- path
  } else {
    stop("Path does not exist: ", path)
  }

  engine <- tesseract::tesseract(lang)

  results <- sapply(files, function(f) {

    # Try pdftools first; check whether the text layer is usable
    pages_raw   <- pdftools::pdf_text(f)
    avg_chars   <- mean(nchar(stringr::str_squish(pages_raw)))
    has_textlayer <- avg_chars >= min_chars_per_page

    if (has_textlayer) {
      message(basename(f), ": using pdftools (text layer detected)")
      txt <- pages_raw |>
        paste0(collapse = " ") |>
        stringr::str_squish()
    } else {
      message(basename(f), ": using tesseract (no usable text layer)")
      txt <- tesseract::ocr(f, engine = engine) |>
        paste0(collapse = " ") |>
        stringr::str_squish()

      if (spell_correct) {
        toks    <- hunspell::hunspell_parse(txt,
                     dict = hunspell::dictionary("en_US"))[[1]]
        correct <- hunspell::hunspell_check(toks,
                     dict = hunspell::dictionary("en_US"))
        suggs   <- hunspell::hunspell_suggest(toks[!correct],
                     dict = hunspell::dictionary("en_US"))
        toks[!correct] <- sapply(suggs, function(s) {
          if (length(s) == 0) NA_character_ else s[1]
        })
        toks <- toks[!is.na(toks)]
        txt  <- paste0(toks, collapse = " ")
      }
    }
    return(txt)
  }, USE.NAMES = TRUE)

  names(results) <- tools::file_path_sans_ext(basename(files))
  return(results)
}

# --- Usage examples ----------------------------------------------------------

# Single file — auto-detect method
text1 <- extract_pdf_text("tutorials/pdf2txt/data/PDFs/pdf0.pdf")

# Directory — auto-detect method for each file
texts_auto <- extract_pdf_text("tutorials/pdf2txt/data/PDFs/")

# Directory — force spell correction for OCR fallback files
texts_corrected <- extract_pdf_text("tutorials/pdf2txt/data/PDFs/", spell_correct = TRUE)

# Non-English document
text_de <- extract_pdf_text("tutorials/pdf2txt/data/PDFs/german_report.pdf", lang = "deu")

Summary

This how-to has covered the complete PDF-to-text workflow in R:

Choosing a tool. pdftools is the right choice for digitally generated PDFs with an embedded text layer — it is fast, requires no external dependencies beyond Poppler, and preserves the document’s pagination and layout. tesseract is the right choice for scanned documents and image-based PDFs — it is slower but handles content that pdftools cannot access at all.

Beyond basic extraction. pdftools also provides document metadata (pdf_info()), page dimensions (pdf_pagesize()), and font information (pdf_fonts()), all of which are useful for provenance tracking and diagnosing encoding problems. tesseract supports over 100 languages via downloadable language models and exposes configuration parameters for page segmentation mode and OCR engine selection that can significantly improve accuracy on challenging documents.

Pre-processing and spell correction. For noisy scans, pre-processing the page images with magick (greyscale conversion, contrast enhancement, deskewing, despeckling) before OCR substantially improves recognition accuracy. Post-OCR spell checking with hunspell identifies non-dictionary tokens and can generate correction candidates, but automated correction should always be reviewed manually before application — particularly for documents containing proper nouns, technical vocabulary, or non-standard spelling conventions.

Production-ready workflow. The extract_pdf_text() function presented in the final section wraps the full pipeline into a single call that automatically detects whether each PDF has a usable text layer and selects the appropriate extraction method accordingly.


Citation and Session Info

Schweinberger, Martin. 2026. Converting PDFs to Text with R. Brisbane: The Language Technology and Data Analysis Laboratory (LADAL). url: https://ladal.edu.au/tutorials/pdf2txt/pdf2txt.html (Version 2026.02.24).

@manual{schweinberger2026pdf2txt,
  author       = {Schweinberger, Martin},
  title        = {Converting PDFs to Text with R},
  note         = {https://ladal.edu.au/tutorials/pdf2txt/pdf2txt.html},
  year         = {2026},
  organization = {The Language Technology and Data Analysis Laboratory (LADAL)},
  address      = {Brisbane},
  edition      = {2026.02.24}
}
Code
sessionInfo()
R version 4.4.2 (2024-10-31 ucrt)
Platform: x86_64-w64-mingw32/x64
Running under: Windows 11 x64 (build 26200)

Matrix products: default


locale:
[1] LC_COLLATE=English_United States.utf8 
[2] LC_CTYPE=English_United States.utf8   
[3] LC_MONETARY=English_United States.utf8
[4] LC_NUMERIC=C                          
[5] LC_TIME=English_United States.utf8    

time zone: Australia/Brisbane
tzcode source: internal

attached base packages:
[1] stats     graphics  grDevices datasets  utils     methods   base     

other attached packages:
 [1] flextable_0.9.11 hunspell_3.0.5   here_1.0.2       lubridate_1.9.4 
 [5] forcats_1.0.0    stringr_1.5.1    dplyr_1.2.0      purrr_1.0.4     
 [9] readr_2.1.5      tidyr_1.3.2      tibble_3.2.1     ggplot2_4.0.2   
[13] tidyverse_2.0.0  tesseract_5.2.2  pdftools_3.4.1  

loaded via a namespace (and not attached):
 [1] utf8_1.2.4              rappdirs_0.3.3          generics_0.1.3         
 [4] fontLiberation_0.1.0    renv_1.1.7              xml2_1.3.6             
 [7] stringi_1.8.4           hms_1.1.3               digest_0.6.39          
[10] magrittr_2.0.3          evaluate_1.0.3          grid_4.4.2             
[13] timechange_0.3.0        RColorBrewer_1.1-3      fastmap_1.2.0          
[16] rprojroot_2.1.1         jsonlite_1.9.0          zip_2.3.2              
[19] scales_1.4.0            fontBitstreamVera_0.1.1 textshaping_1.0.0      
[22] codetools_0.2-20        cli_3.6.4               rlang_1.1.7            
[25] fontquiver_0.2.1        withr_3.0.2             yaml_2.3.10            
[28] gdtools_0.5.0           officer_0.7.3           tools_4.4.2            
[31] uuid_1.2-1              tzdb_0.4.0              vctrs_0.7.1            
[34] R6_2.6.1                lifecycle_1.0.5         htmlwidgets_1.6.4      
[37] ragg_1.3.3              pkgconfig_2.0.3         pillar_1.10.1          
[40] gtable_0.3.6            glue_1.8.0              data.table_1.17.0      
[43] Rcpp_1.1.1              systemfonts_1.3.1       xfun_0.56              
[46] tidyselect_1.2.1        rstudioapi_0.17.1       knitr_1.51             
[49] farver_2.1.2            patchwork_1.3.0         htmltools_0.5.9        
[52] rmarkdown_2.30          qpdf_1.3.4              compiler_4.4.2         
[55] S7_0.2.1                askpass_1.2.1           openssl_2.3.2          
AI Transparency Statement

This how-to was revised and substantially expanded with the assistance of Claude (claude.ai), a large language model created by Anthropic. Claude was used to restructure the document into Quarto format, add the pdftools vs tesseract comparison section, expand the pdftools section with metadata, page-level, and batch-processing examples, expand the tesseract section with language support and engine configuration, write the new multi-page scanned PDF and image pre-processing sections, expand the spell-checking section with a suggested-correction workflow and curated dictionary approach, and write the production-ready extract_pdf_text() wrapper function. All content was reviewed, edited, and approved by the author (Martin Schweinberger), who takes full responsibility for the accuracy of the material.


Back to top

Back to LADAL home


References